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Challenges of PetaScale 
computing and beyond 


Application scalability 

Programming models 

Programming env. and debuggers 
Numerical & communication libraries 
Data storage and archive 


All these topics are related to Fault Tolerance 


Fault tolerance should be considered as a major issue 
and integrated early in the design of a PetaScale system 


some Issues related to fault 
tolerance of parallel applications 


*Why Fault Tolerance is Challenging? 

«What are the main reasons behind failures? 
«What is the general trend about failure rates? 
eRollback recovery or Replication? 

*«Rollback recovery Protocol? 

«Optimization of Rollback Recovery? 
eAlternatives to Rollback recovery? 


Why Fault Tolerance in large 
scale systems is challenging 


e Difficult (scalability), also for the programmer 

* Not always considered as first class issue 

* Has to live with existing (old) software 

e Has to work on large variety of hardware 

* New results need strong efforts (old research 
discipline) 

* Etc. 


* Very expensive, why? --» let's see... 


Classical approach for FT: 


Balanced System Approach 
Computing Speed 


Checkpoint-Restart k Ss 


on 
Dit ox 


Typical “Balanced System” for PetaScale Computers A a s y 
Compute nodes Network Speed Archival 


Gigabytes/sec 


40 to 200 GB/s 


Total memory: Parallel file system 


100-200 TB 


w I/O nodes 


1000 sec. <Ckpt<2500 sec. 


Ckpt time 
LLNL BG/L 500 TF 20min. ^ |LLNL 


TA Har 
$- Checkpoint-Re 


Cost of Checkpointing 


* Use all resources: CPUs (if compression), Memory, Network and Disc --> 
may be one of the most consuming actions in HPC! (this class of 
machines consume >1 MegaWatt/h!) 


e Slowdown the execution: even if checkpoint is non-blocking, 
computations and communications are slowed down significantly 


e Synchronization & message flush: without specific hardware, sync may 
take 1 minute or more --> for a 1 Petaflops machine that's equivalent to a 
cluster of 730 CPUs for 1 day. If you sync 10x per day --» you loose a 
cluster of 7300 CPUs! 


e 1h of a PetaScale systems costs $2K-$4K --> How many checkpoints can 
you afford per day? (estimation based on a cost of $50K-$100K per day 
for a PetaScale machine) 


= Challenge: Reduce the cost of Checkpointing 
or imagine novel FT techniques 


some Issues related to fault 
tolerance of parallel applications 


*What are the main reasons behind failures? 
What is the general trend about failure rates? 
*«Rollback recovery or Replication? 

*«Rollback recovery Protocol? 

«Optimization of Rollback Recovery? 
eAlternatives to Rollback recovery? 


Classical Understanding Approach: 
Failure logs 


(I) High-level system information || (II) Information per node category 
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* failure logs from LANL, NERSC, PNNL, 
ask.com, SNL, LLNL, etc. 
ex: LANL released root cause logs for: 
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Analysis of failure logs 


e |n 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most 
number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 
hours). Hardware problems, albeit rarer, need 6.3-100.7 hours on the average to 


solve." " 
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Conclusion1: Both Hardware and Software failures have to be considered 
Conclusion2: Oliner: logging tools fail too, some key info is missing, better 
filtering (correlation) 


= Challenge: better logging and analyzing tools 


some Issues related to fault 
tolerance of parallel applications 


¢What is the general trend about failure rates? 
*«Rollback recovery or Replication? 

*«Rollback recovery Protocol? 

«Optimization of Rollback Recovery? 
eAlternatives to Rollback recovery? 


Failure rate and #sockets 


In Top500 machine performance X2 per year 
--> more than Moore’s law and increase of #cores in CPUs 


If we consider #core X 2, every 18, 24 and 30 months: . 
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# Sockets in increasing rapidly (100K in 2012) 
The MTBF of CPU is not likely to improve in the near future 
We will reach the 1h. wall as soon as in 2011-2013 
> Challenge: Improve ckpt time or find alternative 
approaches 
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some Issues related to fault 
tolerance of parallel applications 


¢Rollback recovery, Replication, Migration? 
*«Rollback recovery Protocol? 

«Optimization of Rollback Recovery? 
«Alternatives to Rollback recovery? 


Does Replication make sens? 


* Distributed, parallel simulation as 
transaction (message) processing 


A NonStop* Kernel 
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com 


Classic reliable computing: process-pairs — '" ^ Zo 


* Automation possible w/ hypervisors 
Deliver all incoming messages to both 
Match outgoing messages from both 


5096 hardware overhead 


* slowdown of pair synch 
No stopping to checkpoint 


* Less pressure on storage 
bandwidth except for 
visualization checkpoints 
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Application Utilization (9/o) 


analogs: rocesses and messages. 

these primitives, a mechanism that Mir 

fault-tolerant resource access, the 

proosas- -pair, is described. The paper 
ludes with some observations = this 
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* Replication (process pairs) can tolerate only 1 simultaneous failure 


* Full replication is probably too expensive (double the Hardware 
AND power consumption) 


EE Challenge: Replication (N faults) only for nodes 
that are likely to fail during the execution 


some Issues related to fault 
tolerance of parallel applications 


*Rollback recovery Protocol? 
«Optimization of Rollback Recovery? 
«Alternatives to Rollback recovery? 


Roll-Back Recovery Protocols: 
Which one to choose from? 


Consistent Not consistent 


PO Where to checkpoint ui 
P1 to ensure a consistent 
P2 roolback-recovery? 


Automatic 


Optimistic recovery. / Manetho 
In distributed systems 


Cocheck 


Coordinated 
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Latest results in RR protocols 


¢Fully-Non-Blocking Coordinated ckpt. (comp. &comm. continue) 
«Improved message logging 


«Zoning 
Ex: improved sender based message logging in MPI-V 
=== Open MPI-V no Sender-Based 7 Open MPI-V with Sender-Based 
== IPoMX == MPICH-V2 
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= Coordinated and message logging protocols have 
been improved --> not much to gain in this domain 


Reducing Checkpoint size seme 


emory footprin 


«System based 
incremental 
checkpointing 


«Compiler assisted application level checkpoint (Cornel) 
eApplication checkpointing library (CEA) 

«User specific codes E 
«Compression 


eInspector Executor (trace based) 

checkpoint (INRIA study) 

Ex: DGETRF (max gain 20% over IC) 
Challenge: Reducing checkpoint size is probably 
the most difficult problem but lead to strong returns 


some Issues related to fault 
tolerance of parallel applications 


eOptimization of Rollback Recovery? 
eAlternatives to Rollback recovery? 


store ckpt on SSD (Fiash mem.) 


Compute nodes 
SSD 


40 to 200 GB/s 
Total memory: 
100-200 TB 


Parallel file system 
(1 to 2 PB) 


*Current practice --> checkpoint on local disk (1 min.) and then move 
asynchronously ckpt images to persistent storage 
-->Checkpoint still needs 20-40 minutes (does not solve the 30min. Wall!) 


I/O nodes 


«Recent proposal: flash memory (SSD) in nodes or attached to network 
--> Increase the cost of the machine (100 TB or flash memory) + increase 
power consumption + need to store more than 1 ckpt images on remote 
nodes (if SSD on nodes) OR add a large £ of components in the system. 


> Challenge: How to integrate the SSD tech. while 
keeping reasonable cost and #failures? 


Diskless Checkpointing MERE 


Redundancy Encoding Algorithm 
Recovery 


*Need spare nodes --> increases the overall cost and #failures 

*Tolerate only N simultaneous faults (depends on coding) 

*Need a global synchronization & freeze during ckpt. 

*Need very fast encoding & reduction operations 

*Requires a automatic system level Ckpt protocol or program modifications 
*Work with with incremental ckpt. 


-— Challenge: Remove the need for synchronization 
and global Rollback 


some Issues related to fault 
tolerance of parallel applications 


*Where are the opportunities? 


Wrapping-up 


Fault tolerance should be considered as a major issue and 
integrated early in the design of a PetaScale system 


Coordinated and message logging protocols have been 
improved --> not much to gain in this domain 


Challenges: 

«Reduce the cost of Checkpointing (checkpoint size & time) 
*Design better logging and analyzing tools 

*Design less expensive replication approaches 

*ntegrate Flash mem. tech. & keep low cost and #failures? 
«Remove global synch. and rollback in Diskless ckpt. 


Opportunities: we can still try to 


optimize... EAT 
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e Remove sources of non-determinism 
* Checkpoint only at selected moments 
* Limit fault types (ex: network --» MPI-FT - network) 


* Restart at the latest checkpoint (ex: optimistic protocols -- 
subject to domino effect) 


e Save all data “simultaneously” (relax global coordination) 
--> needs message logging between ckpt regions. 


e Specific FT approach for specific patterns (ex: most 
communication patterns are deterministic) 


Opportunities: other FT paradigms? 


* Almost all current researches are based on variations of the original 
Chandy-Lamport algorithm for "determination of consistent global 
states" 

[his algorithm does not consider all types of failures (Byzantine) 


> It is mostly used to restart the execution in the same conditions 
as before the failure 


It concern applications that cannot "survive" to failures uu 
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(Al Geist Christian Engelmann) 
Ex: Meshless iterative methods 


+chaotic relaxation 


Meshless formulation of 2-D finite difference application 


* Other distributed system paradigms considering failures are “normal” 
events:, Speculative Multi-Threading, Software Transactional 
Memory, Self-Stabilization 


Conclusion: need for short 
and long term researches 


«Novel computing methods 
(inherently tolerant to failures) 


«Novel FT paradigm 
*STM, SelfSta, Speculation? 
«Specific hardware? 


*Optimizations: 
«Checkpoint Storage & transfers 
(Additional Local disk, Flash mem.) 
«Reduce ckeckpoint size 
«Proactive Migration (sensors) 
«Replication? 


Definitively, we would be interested in collaborating with 
Japan top 4 ;-) --> what about FT and Language? 


